Search CORE

161 research outputs found

Inherent limitations of probabilistic models for protein-DNA binding specificity

Author: Ruan Shuxiang
Stormo Gary D
Publication venue: Digital Commons@Becker
Publication date: 01/01/2017
Field of study

The specificities of transcription factors are most commonly represented with probabilistic models. These models provide a probability for each base occurring at each position within the binding site and the positions are assumed to contribute independently. The model is simple and intuitive and is the basis for many motif discovery algorithms. However, the model also has inherent limitations that prevent it from accurately representing true binding probabilities, especially for the highest affinity sites under conditions of high protein concentration. The limitations are not due to the assumption of independence between positions but rather are caused by the non-linear relationship between binding affinity and binding probability and the fact that independent normalization at each position skews the site probabilities. Generally probabilistic models are reasonably good approximations, but new high-throughput methods allow for biophysical models with increased accuracy that should be used whenever possible

Directory of Open Access Journals

Digital Commons@Becker

FigShare

Quantitative profiling of BATF family proteins/JUNB/IRF hetero-trimers using Spec-seq

Author: Chang Yiming K
Stormo Gary D
Zuo Zheng
Publication venue: Digital Commons@Becker
Publication date: 01/01/2018
Field of study

Additional file 6. Half site analysis for BATFx-JUNB-IRFx. (A) Single variants from half sites of these oligos in the library were used to generate energy logos. Bolded positions represent the half sites generated in B. (B) Energy logos from Spec-seq results of BATFx-JUNB-IRFx. The Y-axis is negative energy so the preferred sequence is on the top

Directory of Open Access Journals

Digital Commons@Becker

FigShare

Combining SELEX with quantitative assays to rapidly obtain accurate models of protein–DNA interactions

Author: Liu Jiajian
Stormo Gary D.
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

Models for the specificity of DNA-binding transcription factors are often based on small amounts of qualitative data and therefore have limited accuracy. In this study we demonstrate a simple and efficient method of affinity chromatography-SELEX followed by a quantitative binding (QuMFRA) assay to rapidly collect the data necessary for more accurate models. Using the zinc finger protein EGR as an e.g. we show that many bindings sites can be obtained efficiently with affinity chromatography-SELEX, but those sequences alone provide a weight matrix model with limited accuracy. Using a QuMFRA assay to determine the quantitative relative affinity for only a subset of the sequences obtained by SELEX leads to a much more accurate model. Application of this method to variants of a transcription factor would allow us to generate a large collection of quantitative data for modeling protein–DNA interactions that could facilitate the determination of recognition codes for different transcription factor families

Crossref

PubMed Central

Digital Commons@Becker

Quantitative analysis of EGR proteins binding to DNA: assessing additivity in both the binding site and the protein

Author: Liu Jiajian
Stormo Gary D
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Recognition codes for protein-DNA interactions typically assume that the interacting positions contribute additively to the binding energy. While this is known to not be precisely true, an additive model over the DNA positions can be a good approximation, at least for some proteins. Much less information is available about whether the protein positions contribute additively to the interaction. RESULTS: Using EGR zinc finger proteins, we measure the binding affinity of six different variants of the protein to each of six different variants of the consensus binding site. Both the protein and binding site variants include single and double mutations that allow us to assess how well additive models can account for the data. For each protein and DNA alone we find that additive models are good approximations, but over the combined set of data there are context effects that limit their accuracy. However, a small modification to the purely additive model, with only three additional parameters, improves the fit significantly. CONCLUSION: The additive model holds very well for every DNA site and every protein included in this study, but clear context dependence in the interactions was detected. A simple modification to the independent model provides a better fit to the complete data

Springer - Publisher Connector

PubMed Central

Digital Commons@Becker

Assessing the effects of symmetry on motif disovery and modeling

Author: Motlhabi Lala M
Stormo Gary D
Publication venue: Digital Commons@Becker
Publication date: 01/01/2011
Field of study

BACKGROUND: Identifying the DNA binding sites for transcription factors is a key task in modeling the gene regulatory network of a cell. Predicting DNA binding sites computationally suffers from high false positives and false negatives due to various contributing factors, including the inaccurate models for transcription factor specificity. One source of inaccuracy in the specificity models is the assumption of asymmetry for symmetric models. METHODOLOGY/PRINCIPAL FINDINGS: Using simulation studies, so that the correct binding site model is known and various parameters of the process can be systematically controlled, we test different motif finding algorithms on both symmetric and asymmetric binding site data. We show that if the true binding site is asymmetric the results are unambiguous and the asymmetric model is clearly superior to the symmetric model. But if the true binding specificity is symmetric commonly used methods can infer, incorrectly, that the motif is asymmetric. The resulting inaccurate motifs lead to lower sensitivity and specificity than would the correct, symmetric models. We also show how the correct model can be obtained by the use of appropriate measures of statistical significance. CONCLUSIONS/SIGNIFICANCE: This study demonstrates that the most commonly used motif-finding approaches usually model symmetric motifs incorrectly, which leads to higher than necessary false prediction errors. It also demonstrates how alternative motif-finding methods can correct the problem, providing more accurate motif models and reducing the errors. Furthermore, it provides criteria for determining whether a symmetric or asymmetric model is the most appropriate for any experimental dataset

Directory of Open Access Journals

Digital Commons@Becker

PubMed Central

Finding motifs using DNA images derived from sparse representations

Author: Chu Shane K
Stormo Gary D
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/06/2023
Field of study

MOTIVATION: Motifs play a crucial role in computational biology, as they provide valuable information about the binding specificity of proteins. However, conventional motif discovery methods typically rely on simple combinatoric or probabilistic approaches, which can be biased by heuristics such as substring-masking for multiple motif discovery. In recent years, deep neural networks have become increasingly popular for motif discovery, as they are capable of capturing complex patterns in data. Nonetheless, inferring motifs from neural networks remains a challenging problem, both from a modeling and computational standpoint, despite the success of these networks in supervised learning tasks. RESULTS: We present a principled representation learning approach based on a hierarchical sparse representation for motif discovery. Our method effectively discovers gapped, long, or overlapping motifs that we show to commonly exist in next-generation sequencing datasets, in addition to the short and enriched primary binding sites. Our model is fully interpretable, fast, and capable of capturing motifs in a large number of DNA strings. A key concept emerged from our approach-enumerating at the image level-effectively overcomes the k-mers paradigm, enabling modest computational resources for capturing the long and varied but conserved patterns, in addition to capturing the primary binding sites. AVAILABILITY AND IMPLEMENTATION: Our method is available as a Julia package under the MIT license at https://github.com/kchu25/MOTIFs.jl, and the results on experimental data can be found at https://zenodo.org/record/7783033

Digital Commons@Becker

Discovering structural cis-regulatory elements by modeling the behaviors of mRNAs

Author: Foat Barrett C
Stormo Gary D
Publication venue: Nature Publishing Group
Publication date: 01/01/2009
Field of study

Gene expression is regulated at each step from chromatin remodeling through translation and degradation. Several known RNA-binding regulatory proteins interact with specific RNA secondary structures in addition to specific nucleotides. To provide a more comprehensive understanding of the regulation of gene expression, we developed an integrative computational approach that leverages functional genomics data and nucleotide sequences to discover RNA secondary structure-defined cis-regulatory elements (SCREs). We applied our structural cis-regulatory element detector (StructRED) to microarray and mRNA sequence data from Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens. We recovered the known specificities of Vts1p in yeast and Smaug in flies. In addition, we discovered six putative SCREs in flies and three in humans. We characterized the SCREs based on their condition-specific regulatory influences, the annotation of the transcripts that contain them, and their locations within transcripts. Overall, we show that modeling functional genomics data in terms of combined RNA structure and sequence motifs is an effective method for discovering the specificities and regulatory roles of RNA-binding proteins

Crossref

PubMed Central

Digital Commons@Becker

Measuring quantitative effects of methylation on transcription factor-DNA binding affinity

Author: Chang Yiming Kenny
Granas David
Roy Basab
Stormo Gary D
Zuo Zheng
Publication venue: Digital Commons@Becker
Publication date: 01/01/2017
Field of study

Digital Commons@Becker

Making connections between novel transcription factors and their DNA motifs

Author: McCue Lee Ann
Stormo Gary D.
Tan Kai
Publication venue: Digital Commons@Becker
Publication date: 01/01/2005
Field of study

The key components of a transcriptional regulatory network are the connections between trans-acting transcription factors and cis-acting DNA-binding sites. In spite of several decades of intense research, only a fraction of the estimated ∼300 transcription factors in Escherichia coli have been linked to some of their binding sites in the genome. In this paper, we present a computational method to connect novel transcription factors and DNA motifs in E. coli. Our method uses three types of mutually independent information, two of which are gleaned by comparative analysis of multiple genomes and the third one derived from similarities of transcription-factor-DNA-binding-site interactions. The different types of information are combined to calculate the probability of a given transcription-factor-DNA-motif pair being a true pair. Tested on a study set of transcription factors and their DNA motifs, our method has a prediction accuracy of 59% for the top predictions and 85% for the top three predictions. When applied to 99 novel transcription factors and 70 novel DNA motifs, our method predicted 64 transcription-factor-DNA-motif pairs. Supporting evidence for some of the predicted pairs is presented. Functional annotations are made for 23 novel transcription factors based on the predicted transcription-factor-DNA-motif connections

Digital Commons@Becker

PubMed Central

Fast, sensitive discovery of conserved genome-wide motifs

Author: Buhler Jeremy
Ihuegbu NNamdi E
Stormo Gary D
Publication venue: Digital Commons@Becker
Publication date: 01/01/2012
Field of study

Regulatory sites that control gene expression are essential to the proper functioning of cells, and identifying them is critical for modeling regulatory networks. We have developed Magma (Multiple Aligner of Genomic Multiple Alignments), a software tool for multiple species, multiple gene motif discovery. Magma identifies putative regulatory sites that are conserved across multiple species and occur near multiple genes throughout a reference genome. Magma takes as input multiple alignments that can include gaps. It uses efficient clustering methods that make it about 70 times faster than PhyloNet, a previous program for this task, with slightly greater sensitivity. We ran Magma on all non-coding DNA conserved between Caenorhabditis elegans and five additional species, about 70 Mbp in total, in <4 h. We obtained 2,309 motifs with lengths of 6–20 bp, each occurring at least 10 times throughout the genome, which collectively covered about 566 kbp of the genomes, approximately 0.8% of the input. Predicted sites occurred in all types of non-coding sequence but were especially enriched in the promoter regions. Comparisons to several experimental datasets show that Magma motifs correspond to a variety of known regulatory motifs

Digital Commons@Becker

PubMed Central